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Understanding how and how far information, behaviors, or pathogens spread in social networks 
is an important problem, having implications for both predicting the size of epidemics, as well as 
for planning effective interventions. There are, however, two main challenges for inferring spreading 
paths in real-world networks. One is the practical difficulty of observing a dynamic process on a 
network, and the other is the typical constraint of only partially observing a network. Using a 
static, structurally realistic social network as a platform for simulations, we juxtapose three distinct 
paths: (1) the stochastic path taken by a simulated spreading process from source to target; (2) 
the topologically shortest path in the fully observed network, and hence the single most likely 
stochastic path, between the two nodes; and (3) the topologically shortest path in a partially 
observed network. In a sampled network, how closely does the partially observed shortest path 
(3) emulate the unobserved spreading path (1)? Although partial observation inflates the length 
of the shortest path, the stochastic nature of the spreading process also frequently derails the 
dynamic path from the shortest path. We find that the partially observed shortest path does not 
necessarily give an inflated estimate of the length of the process path; in fact, partial observation 
may, counterintuitively, make the path seem shorter than it actually is. 

PACS numbers: 89.75.-k, 89.75.Hc, 02.50.Tt 



I. INTRODUCTION 

The small-world property, first empirically discovered 
by Milgram [1] and then revisited by many, perhaps most 
famously by Watts and Strogatz [2] , captures the remark- 
able idea that we are all connected to each other via very 
short paths, typically encompassing only a handful of in- 
termediaries. Path-based network measures, such as di- 
ameter and average path length, are useful elementary 
network characteristics, but exploring paths and path 
lengths is especially important when dealing with pro- 
cesses on networks that may be able permeate only up 
to a finite depth. This is relevant for a large class of 
general infection processes, such as the propagation of a 
certain behavior, the transmission of a piece of informa- 
tion, or the spread of a pathogen. As a first approxi- 
mation, one might, of course, assume that any of these 
may percolate through entire social networks, and indeed 
the relationship between the three paths discussed in this 
paper also hold in that case. However, it is likely in prac- 
tice that the information being transmitted gets altered 
along the way; the behavior gets modified as it is imi- 
tated; or the pathogen becomes mutated as it is passed 
on. Consequently, the penetration depth of a given piece 
of information, a given behavior, or a given pathogen is 
often bounded. When this is the case, understanding 
path lengths becomes especially important, 

More nuanced accounts of spreading phenomena 
should distinguish between these different variants of 
information, behaviors, or pathogens. When viewed 
from this angle, any given spreading processes in itself, 
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most likely, has a finite (typically stochastic) permeation 
depth. But measuring these depths is difficult in practice 
because of the fundamental difficulty in monitoring the 
unfolding of real- world spreading processes. Even when 
time-stamped interaction events are available, such as in 
some recent insightful studies utilizing cell phone commu- 
nication data [3, 4], one still does not have actual spread- 
ing data but, instead, needs to assume that something is 
being spread, possibly across multiple ties, and one also 
needs to operationalize this assumption. (We would like 
to point out to the reader that the notion of temporal dis- 
tance, corresponding to the time-ordered shortest path 
between nodes and defined for empirical event sequences 
in [4] , is different from the notion of dynamic path lengths 
discussed below.) In contrast, unlike the process itself, 
outcomes of a spreading process are often directly ob- 
servable (e.g. symptoms of chicken pox). But even if we 
could observe the outcomes, a key remaining challenge 
in dealing with person-to-person social networks is that 
instead of observing the full network evolve in time, fi- 
nancial and human resources, ethical considerations, and 
methodological issues typically limit us to a sampled, or 
partially observed, network snapshot (an exception is ex- 
perimental networks [5, 6]). 

Transport processes, such as the routing of data on 
the Internet [7], are somewhat different from but re- 
lated to the spreading processes discussed above. In 
contrast to the World Wide Web, which allows for the 
links from each site to be observed, it is not possible to 
directly map the physical connections between Internet 
routers. Instead, these networks are typically sampled 
using traceroute-like methods, which are trees initiated 
from a single source. It has been recently shown both 
empirically [8] and analytically [9, 10] that the resulting 
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sampled networks are biased [11]. Notwithstanding the 
common assumption that data packets follow shortest 
routes from source to target, it was found that, although 
the undirected shortest paths had a mean length of 11.4, 
the routes had a mean length of 15.6 hops [12] and only 
19.3% of the routes taken were along the shortest paths. 

Here, we focus on the problem of estimating infection 
path lengths for an unobsevable stylized infection pro- 
cess in a partially observed social network. Similar to 
degrees of separation, which quantify how far nodes are 
from each other, infection path lengths, also known as 
degrees of influence, quantify how far a given process 
might spread in the network [13]. In the case of contact 
networks, understanding path lengths might enable us to 
estimate the virulence of a pathogen and its nature, e.g., 
how frequently it mutates. In the case of social networks, 
understanding path lengths might enable us to evaluate 
the infectiousness of certain behaviors and experiences, 
such as obesity, depression, voting, and smoking [14, 15]. 
Understanding how far these conditions may be able to 
spread from one person to another has important conse- 
quences for both gauging the overall extent of these "so- 
cial epidemics," as well as for planning the most effective 
interventions. Both goals are of substantial importance 
from the point of view of public policy. 

Since one cannot in practice follow the paths taken 
by an actual infection or spreading process, the shortest 
path connecting the source and target nodes functions as 
a reasonable proxy for the actual path. Indeed, the short- 
est path is the single most likely path connecting a given 
source node to a given target node, since the probability 
for a given path, under some fairly general assumptions, 
decreases exponentially as a function of its length. A 
counterbalancing factor is that the number of paths, or 
path degeneracy, increases as a function of the distance 
between the source and target nodes, and this happens 
in a way that depends delicately on the structure of the 
network. An important consequence is that spreading 
phenomena often do not follow the shortest paths. Still, 
all in all, the shortest path is always our best guess for 
the actual path, given that in a practical setting one does 
not have microscopic spreading data available. 

This results in three different paths to consider (Fig. 1). 
First, there is the stochastic path of length £ from node 
i to node j, followed by the as-yet-unspecified but inher- 
ently unobservable dynamic process; second, there is the 
unsampled, potentially observable, but often only par- 
tially observed, shortest path of length £ u (subscript u for 
"unsampled") between nodes i and j; and, finally, there 
is the shortest path in the sampled network of length £ s 
(subscript s for "sampled") from node i to j. 

As mentioned above, the relationship between the 
three paths holds whether or not the spreading process 
has a finite permeation depth. However, when this is the 
case, the problem becomes even more relevant, because 
now the properties of the thing that is spreading might 
be related to the length of the actual path it has taken 
through the system. For example, the relative stability 




FIG. 1. (Color online) Schematic of a network infection and 
sampling process. (A) The full (unobserved) network with 
the initially infected node colored (upper left corner). (B) 
The shortest path from the source node to the target node 
(lower right corner) corresponds to the most likely infection 
path in the fully observed network and has £ u = 2. (C) The 
(unobservable) spreading process unfolds in the (unobserved) 
network. The actual path taken by the infection is shown 
with wavy edges. The target node is reached in three steps 
giving £ = 3. (D) The partially observed network has some 
nodes and links missing depending on the sampling param- 
eters. The shortest path from source to target has length 
£ s = 3, corresponding to the length of the most likely path 
taken by the infection. In this case, using the shortest path 
length in the fully observed network £ u to estimate the actual 
path length / would result in an underestimate of path length, 
whereas using the path in the partially observed network, in 
this case, correctly yields £ = £ u = 3. 



or mutability of pathogens can depend on the properties 
of the system through which they are moving. Recently, 
genotyping of pathogens has been combined with social 
network mapping to identify likely point sources of epi- 
demics, and infection paths; this work has contrasted 
biological and social network approaches to identifying 
and quantifying outbreaks [16]. 

We will explore some of the properties of these three 
distinct paths by using a real-world social network as 
a platform for simulating both the spreading process 
and the subsequent sampling process. We introduce the 
dataset in Section II, and describe the details of our ap- 
proach in Section III. The main results are presented in 
Section IV, and we discuss our findings Section V. 

II. COMMUNITY NETWORKS 

In this Section, we study path lengths for a simple 
spreading process on a static real-world social network 
with sampling. The platform network possess all the pro- 
totypical features of social networks: a fat-tailed degree 
distribution, assortativity by degree, a high level of clus- 
tering, the small-world property, and network communi- 
ties. Our results are therefore expected to hold for social 
(and other) networks with similar characteristics. The 
platform is a communication network constructed from 
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72.4 million private one-to-one cell phone calls among 3.4 
million individuals in an undisclosed European country 
over a one-month period [17-19]. This allows the com- 
prehensive ascertainment of ties between people who are 
customers of the given cell phone operator, and results 
in a fairly realistic human social network. We keep only 
reciprocated ties, and denote the number of calls placed 
between nodes i and j with Wij = Wji, which can be 
conceptualized as tie strength. 

Instead of dealing with the entire network, we wish 
to use several non-overlapping samples of the network 
with varying properties (size, density, etc.) by slicing 
it where it most naturally breaks into pieces, which is 
across communities. To that end, we identify the largest 
80 communities [20-24] and use them as our samples. 
To avoid confusion with subsequent node and tie sam- 
pling, we refer to these network samples as community 
networks. We detect network communities using modu- 
larity maximization in its original formulation [25, 26]. 
Modularity, which is a number lying between -1 and 1, 
measures how well a given partition {ci, C2, . . . , cjy} of a 
network compartmentalizes its communities, is given by 
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FIG. 2. Visualization of one of the 80 community networks 
used in this study. The network consists of one-to-one cell 
phone calls, and this particular network contains 2130 nodes. 



where the adjacency matrix element denotes the 
strength of the tie connecting nodes i and j, ki is the de- 
gree of node z, L the total weight of the edges (or number 
of unweighted edges) in the network, q the community 
assignment of node z, and £(q, Cj) is the Kronecker delta 
function, which is unity if and only if q = Cj, otherwise 
it is zero. Modularity, in its original formulation, mea- 
sures the difference between the total fraction of edges 
that fall within groups versus the fraction one would ex- 
pect by chance. The common null model, codified by the 
kikj/(2L) term, takes degree heterogeneity into account 
by preserving the expected degree distribution. High val- 
ues of Q indicate network partitions in which more of the 
edges fall within groups than expected by chance. While 
maximizing modularity is known to be an NP-hard prob- 
lem [27], there are numerous computational heuristics 
available [20, 21], and our choice is the so-called Louvain 
method [28]. 



III. SPREADING ON AND SAMPLING OF 
COMMUNITY NETWORKS 

Here, we describe the spreading and sampling pro- 
cesses which are carried out on each of the 80 static 
community networks. We use the canonical Susceptible- 
Infectious (SI) model, in which each node occupies one of 
the two states (S or I) [29]. The stylized spreading pro- 
cess is carried out in the fully observed community net- 
works, and it proceeds as follows. For each community 
network, starting from one initially infected seed node, 
each infected node, per time step, attempts to infect one 



of its neighbors chosen uniformly at random. The length 
of a time step is therefore defined as the shortest possible 
time during which the infection can spread from an infec- 
tious node to a susceptible node. For node j with degree 
kj, this selection probability is given by pj = 1/fcj, which 
corresponds to an isotropic one-step random walk. We 
call this the unweighted selection because the choice of 
the neighbor is topological only, meaning that the neigh- 
bor is selected uniformly at random. In contrast, we 
also use weighted selection, where a neighbor k of node 
j is chosen with probability p^ = Wj^/ ^2 m Wj rn , where 
Wjm represents the strength of the tie between nodes j 
and m, quantified in terms of call volume as described 
above, leading to neighbor selection that is biased to- 
wards stronger ties. 

Once the neighbor has been chosen, the infection hap- 
pens with infection probability, which we have fixed at 
0.05. We run each realization of the simulation for 200 
time steps, which is a sufficiently long time, given the 
value of infection probability, to allow for even very long 
paths (of the order network diameter) to emerge. We 
keep track of every infection path by tabulating the pre- 
decessors (parents) of each newly infected node, and in 
case of repeat infections, i.e, an already infected node is 
made infected for the second time, we only keep track of 
the first infection event (hence ignoring complex conta- 
gion processes [5, 30, 31]). 

We are interested in the length of the dynamic path 
£ taken by the infection from the seed node to multiple 
target nodes. In particular, we now wish to make infer- 
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ences about path lengths, taken by the spreading pro- 
cess described above, under partial network observation. 
The latter is achieved using a computational approach, 
which simulates a twofold ego-centric sampling design. 
The simulated sampling design is termed conventional 
because, unlike an adaptive design, it does not use infor- 
mation collected during the "survey," or earlier stages of 
the sampling process, to direct subsequent sampling [32]. 

The two stages making up the partial observation are 
node sampling and tie sampling. First, node sampling, 
for which the units of sampling are nodes, refers to the 
process of observing a fraction of the nodes, where, more- 
over, only ties that fall between the observed nodes are 
retained in the sample. Node sampling, sometimes also 
called node filtering, therefore corresponds to the idea of 
observing only a subset of the nodes. We use f n to denote 
the fraction of unobserved nodes, such that 1 — f n is the 
fraction of observed or sampled nodes. The idea of node 
sampling is similar to the study of random breakdowns 
of networks in the context of percolation theory. Starting 
with an initial degree distribution P(fco), the probability 
that a node of degree ko becomes a node of degree fc, 
where k < k , is given by ( fe fc °)(l - fn) k fn°~ k , and the 
new degree distribution [33] becomes 



p'(k)= f:p(fco)( fc ;)(i-/n) fe /„ fe °- fe , 

ko=k ^ ' 



(2) 



where the post-sampling quantities are denoted by a 
prime. This leads to an average degree of (k) f = (fco)(l — 
f n ) in the sampled network. 

Second, tie sampling, for which the units of sampling 
are network ties, refers to the idea that we typically ob- 
serve only some fraction of the contacts (neighbors) of 
each sampled node. It encapsulates the notion that hu- 
man subjects commonly do not disclose all of their social 
contacts, a problem that can be partially mitigated by 
suitable name generators, which are survey instruments 
used to solicit information from individuals about the 
people whom they are connected to [34-37] . 

For generality, we allow for arbitrary combinations of 
node and tie sampling. However, when combining the 
two, we always carry out node sampling first and tie sam- 
pling second, which is the order these two processes would 
occur in a real- world sampling situation. Note that when 
combining the two sampling processes, the actual num- 
ber of ties removed in tie sampling is computed from the 
initial number of ties present in the network prior to node 
sampling. 

To clarify this, consider a network of TV nodes and L 
links. Since the sampled nodes are chosen uniformly at 
random from the node population, any tie is included in 
the sample if and only if the adjacent nodes are included. 
Since each node is included in the sample with probabil- 
ity (1 — f n ), on average a fraction (1 — f n ) 2 of ties in the 
network will be included in the sample after node sam- 
pling. For example, if f n = 0.2, the expected number 
of ties is 0.64L. If we subsequently apply tie sampling 



using, say, f e = 0.2, the expected fraction of ties falls 
further to 0.64L - 0.2L = 0.44L. In other words, us- 
ing these sampling parameters, less than half of the ties 
in the network would be present in the sample. In gen- 
eral, as a consequence of the full (node & tie) sampling 
process, the expected number of nodes in the sample is 
N' = (1 — f n )N, whereas the expected number of ties in 
the sample is L' = [(1 — f n ) 2 — f e ]L. 



IV. SIMULATION RESULTS 

In this Section, we report results on three different 
types of inference. First, to what extent do path lengths 
£ s in a partially observed or sampled network represent 
path lengths £ u in the underlying unsampled network? 
Second, if it were possible to observe the network fully, 
how well would topological paths represent the actual 
(unobserved) dynamic paths as followed by the process? 
Third, if the network were to be only partially observed, 
how well do sampled topological paths represent the ac- 
tual (unobserved) dynamic paths followed by the pro- 
cess? Note that the first question is strictly topological, 
while the second and third questions are affected by both 
network topology and process dynamics. 

To quantify these biases, we define three bias factors, 
where the averages are taken over different process real- 
izations. First, the ratio of sampled path length to un- 
sampled path length as a function of actual path length 
is denoted 



(3) 



since £ s (£) > £ u (£) for all £; second, the ratio of unsam- 
pled path length to the actual path length is 



M0 = £#Si 



(4) 



since £ > £ u (£) for all £; and, third, the ratio of sampled 
path length to the actual path length 



W) = ^P>0 



(5) 



but is otherwise unbounded. The corresponding averages 
are 



6i = (6iW>, b 2 = (b 2 (e)), b 3 = (b 3 (£)). 



(6) 



where the averages are taken over a range of values for £. 
For any network, &i > 1, 62 < 1, and 63 > 0. 

In Figs. 3 and 4, we show the average path lengths 
using unweighted neighbor selection for community net- 
works of ~2,000 nodes and ~20,000 nodes, respectively, 
averaged over 1,000 attemped realizations (see discus- 
sion below), for the sampled path lengths £ s , shown in 
red, and the unsampled path lengths £ u , shown in blue, 
as a function of the actual path length £ as followed 
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FIG. 3. (Color online) Observed path lengths (OPL) as a function of the actual path lengths (APL) in a medium-size community 
of N ~ 2, 000 nodes. Each panel corresponds to a different fraction of unobserved nodes and unobserved ties as indicated in 
each panel by the (/ n , fe) pair. The blue curves correspond to the average unsampled path lengths £ u and the red curves to the 
average sampled path lengths £ s . The extent of fluctuations is indicated with the error bars, which are given as £ u (£) ±cr u (£) for 
unsampled paths and £ s (£) ±a s (£) for sampled paths. The diagonal dashed black lines corresponds to the identity relationship, 
i.e., points where the observed path lengths are identical to the actual path lengths. For this particular community, the sampled 
path lengths are typically shorter than the actual path lengths. 



by the infection process. Since the process is run 1,000 
times for each combination of sampling parameter values 
(fm fe), each dot represents an average. To quantify the 
extent of fluctuations around the average, we also com- 
pute standard deviations, such that the plotted function 
can be expressed as £ u (£) db cr u (£) for unsampled paths 



and £ s (£) db cr s (£) for sampled paths, where cr u (£) and 
<j s (£) are the corresponding standard deviations. Note 
that the sampled path lengths £ s are necessarily as long 
as or longer than the unsampled path lengths £ U: mean- 
ing that the red curve always lies on or above the blue 
curve. The distance between the red and blue curves de- 
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FIG. 4. (Color online) Observed path lengths (OPL) as a function of the actual path lengths (APL) in a large community of 
N « 20, 000 nodes. For this particular community, the average sampled path lengths £ s may be longer or shorter than the 
actual path lengths, depending on the values of f n and de- 



scribes the bias due to approximating the (unobserved) 
shortest paths in the original network with the (sampled) 
paths in the perturbed network and is quantified by b\. 
Note also that the blue curve is always on or below the 
black line, consistent with the fact that the actual path 
can never be shorter than the shortest path (by defini- 
tion) . The distance between the blue curve and the black 
line is the bias due to not having observed the spreading 
process, but instead approximating it with shortest paths 
computed in the (typically unobserved) original network. 



This bias is quantified by 62 . Finally, the gap between the 
red and the black line is the bias due to not having ob- 
served the spreading process but, instead, approximating 
it with sampled shortest paths, i.e., shortest paths com- 
puted in the perturbed network. The extent of this bias 
is quantified by 63. 

Depending on the network, the sampled paths may be 
longer or shorter than the actual paths, but which of 
these outcomes is more typical? For any of the 80 com- 
munity networks, and for any of the 49 unique (/ n? /e) 
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sampling parameter combinations, and for any value of 
the the actual path length {1,2,3,4,5,6}, we obtain 
1,000 attempted realizations for £ u and £ s . We say at- 
tempted because the more we sample, the thinner the 
resulting network becomes, and consequently the smaller 
the number of paths of any given length in the sampled 
network. Under heavy sampling, it is possible that not 
every realization contains of path of length, say, £ = 6. 
For this reason, when computing the mean path lengths 
and the standard deviations, the statistics need to be 
weighted. To accomplish this, let us first expand our 
earlier notation slightly. We let £ u (£,v) and £ s (£^) rep- 
resent the average path lengths, unsampled and sampled, 
respectively, at distance £ for network 77; similarly we let 

I 



where {£${£))„ is simply a weighted mean of the means, 
whereas (<j s (£)}^ has two components, the former being 
a weighted mean of the variances, and the latter being a 
weighted mean of the squares of all pairwise differences of 
the means. Both results follow from a direct calculation, 
and the expressions for the unsampled paths are identical 
and follow by changing the subscripts from s to u. We 
show the plots of {£u^))r 1 H (J u^))r 1 and (4 (£)) ^{g s (£)) „ 
for both unweighted and weighted neighbor selection in 
Fig. 5 As expected, the average unsampled paths under- 
estimates the actual path lengths, and the extent of this 
bias increases as £ increases. The sampled path lengths 
may however overestimate or underestimate the actual 
path lengths. While the averages behave very similarly, 
there are significant differences in fluctuations between 
the unweighted and weighted spreading process. While 
the weighted process in general shows more fluctuations, 
the extent of fluctuations is especially pronounced for 
sampled path lengths. In other words, the weighted 
spreading process may veer the dynamic path even fur- 
ther from the structurally shortest paths. 

The average outcomes are surprisingly similar for un- 
weighted and weighted neighbor selection, which could 
have its origin in how the community networks are sam- 
pled and the connection between network structure and 
tie strength as quantified by the weak ties hypothesis 
[17, 38]. To elaborate on this, we would expect commu- 
nity networks to have a high density of ties, higher than 
what would be expected by chance. The weak ties hy- 
pothesis, on the other hand, states that there is a positive 
association between the fraction of shared friends any two 
connected individuals i and j have and the strength of 
the tie Wij connecting them. This suggests that most ties 
within communities would be expected to be fairly strong 
and, consequently, the impact of incorporating weights 



cr u (£,r]) and cr s (£,r]) represent the corresponding stan- 
dard deviations of the path lengths; and finally n n (£, 77) 
and n s (£, 77) are the number of observations in each cat- 
egory, which are less than or equal to 10 3 , the number 
of attempted realizations in each category. The values of 
f n and f e are considered fixed. The ensemble mean for 
sampled paths is now given by 

n 

and the ensemble standard deviation is given by 



I 

in the neighbor selection process might be fairly small. 
However, as indicated above, the extent of fluctuations 
is much greater for the weighted neighbor selection than 
for the unweighted one. 

In order to express the bias for all examined path 
lengths £ = 1,...,6, and over all 80 community net- 
works, we computed the conditional averages (&i|/ n ,/ e )> 
(folfn, fe), and (63 1 / n , fe), which quantify the overall bias 
for given levels of node and tie sampling, and they are 
shown in Fig. 6. The underlying numerical values are 
given in Table I. For example, using f n = f e = 0.2, 
which implies that after sampling 44% of ties remain in 
the network, results in (6i|/ n = 0.20, f e = 0.20) = 1.50, 
showing that sampled paths are 50% longer than unsam- 
pled paths for the given level of node and tie sampling; 
(folfn = 0.20, f e = 0.20) = 0.88 shows that the un- 
sampled topological paths are 88% of the length of the 
stochastic paths; and finally (bs\f n — 0.20, f e — 0.20) = 
1.32 shows that sampled topological paths over-estimate 
path length by 32%. 

The above averages, although informative, mask the 
variation from one community network to another. 
Therefore, instead of averaging over community net- 
works, we average, each network, over the sampling pa- 
rameters f n and f e . Figs. 7, 8, and 9 show the value of 
this average bias plotted against network size TV, number 
of links L, link density d = 2L/N(N — 1), and aver- 
age shortest path length (£) for all 80 subnetworks. To 
three of the four plots in each figure, we fitted a linear 
regression model of the form (b) = /3q + Pi log(x), where 
x is either TV, L, or d. To gauge the goodness of fit of 
the model, we used the simple (non- adjusted) R 2 statis- 
tic. For each bias factor, (61), (62), and (63), we find 
that most variance is always explained by L (number of 
links), then by TV (number of nodes), and finally by d 



£U=i n,(t,T}) fr[ 



n s (£,i)a 2 s (£,i) + 



V V n s (£,i)n s (£,j)[£ s (£,i)-£ s (£,j)] 2 , (8) 



(E n n =in s (£,v)) 2 <=i j=<+i 
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Unweighted neighbor selection 



Weighted neighbor selection 





FIG. 5. (Color online) Observed path lengths (OPL) as a function of the actual path lengths (APL) averaged over 80 com- 
munities using unweighted neighbor selection (8 panels on the left) and weighted neighbor selection (8 panels on the right). 
For each type of neighbor selection, the leftmost column corresponds to node sampling only, where the value of f n is indicated 
in the panel, and f e = for all panels (i.e. there is no edge sampling). The rightmost columns for each type of neighbor 
selection correspond to edge sampling, where the value of f e is indicated in the panel, and f n = for all panels (i.e. there is 
no node sampling). Shortest paths in partially observed networks typically overestimate the actual path lengths, the extent of 
which depends on the sampling parameters as well as the length of the actual path £ taken by the process. Note that weighted 
neighbor selection in the spreading process introduces considerable fluctuations, meaning that if the process is sensitive to tie 
strengths, sampled topological paths reflect the actual process path length poorly, and may either significantly overestimate or 
underestimate the path length. 



(link density), although the three typically come close 
to one another. The R 2 values using L as the predictor 
of bias are 0.89, 0.63, and 0.88 for (62), and (63), 

respectively, and the corresponding parameter values of 
interest are ft = 0.4826 (for (61)), ft = 0.0320 (for (6 2 )), 



and ft = 0.4698 (for (63)). Consequently, of the three, 
the value of (62) is by far the least sensitive to variation 
in L (or N or d\ see Fig. 8). 

Therefore, using unsampled topological paths for 
stochastic paths typically results in a fairly small over- 
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FIG. 6. (Color online) Surface plots of sampling bias as a 
function of f e , the fraction of removed edges (horizontal axes) 
and / n , the fraction of removed nodes (vertical axes) for the 
unweighted neighbor selection as described in Section III. (a) 
Plot of bi, the average ratio of sampled path lengths to un- 
sampled path lengths, (b) Plot of 62, the average ratio of 
unsampled path lengths to to actual path lengths, (c) Plot 
of 63, the average ratio of sampled path lengths to to ac- 
tual path lengths. The results are essentially identical for the 
weighted neighbor selection. 



all bias, and the bias is always downwards as expected, 
and therefore the resulting values for (62) are always be- 
low one. In contrast, using sampled topological paths for 
stochastic paths may result in an upward or downward 
bias, depending the network and the sampling parame- 
ters, such that (63) may be less than one or more than 
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1.31 1.36 

1.34 1.39 

1.36 1.39 

1.36 1.35 

1.31 1.17 



TABLE I. Values of different bias ratios. Top panel: 
{bi\f n , fe), the average ratio of sampled shortest path lengths 
to unsampled shortest path lengths. Middle panel: (&2|/n,/e), 
the average ratio of unsampled shortest path lengths to actual 
path lengths. Bottom panel: (&3|/n,/e), the average ratio of 
sampled shortest path lengths to actual path lengths. All ra- 
tios are tabulated according to f n and f e . The value in the 
bottom right corner for f n = f e = 0.30 in the top and bottom 
panels deviates from the trend present in the two tables. As 
this value corresponding to the greatest degree of sampling 
(both tables deal with sampled path lengths) and hence to 
the least number of data points in the average, it is likely a 
statistical fluctuation. 



one. The extent of this bias is well predicted by the num- 
ber of links L in the network, and the value f3± = 0.4698 
suggests that multiplying the number of links by a fac- 
tor of ten results in an addition of 0.47 in its value. Of 
the studied 80 community networks, 35 had (63) less than 
one; based on the results of the regression models, in par- 
ticular the locations where the regression lines meet the 
(horizontal) no-bias lines, these networks have typically 
less than 3,500 nodes, less than 5,000 links, high link 
density (d > 0.0008), and average shortest path length 
greater than 25. In other words, compared to the popu- 
lation of studied community networks, these tend to be 
small and relatively densely connected networks. 



V. DISCUSSION 

The last few years have seen a strong emphasis in the 
literature on understanding structural properties of com- 



10 




20 40 60 

Average shortest path length (<l>) 




20 40 60 

Average shortest path length (<l>) 



FIG. 7. (Color online) Average bias-ratios (bi) (sampled path 
length divided by unsampled path length) as a function of dif- 
ferent network characteristics for the studied 80 subnetworks. 
The value of (bi) is always greater than one. 
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FIG. 8. (Color online) Average bias-ratios (62) (unsampled 
path length divided by dynamic path length) as a function of 
different network characteristics. The value of (62) is always 
less than one. 



plex networks, although increasingly the field appears to 
be moving in the direction of network dynamics, where 
dynamics can be understood both as dynamics of net- 
works and dynamics on networks. Spreading and dif- 
fusion processes are the archetypes of dynamical pro- 
cesses on networks. In this paper, we have explored 



FIG. 9. (Color online) Average bias-ratios (63) (sampled path 
length divided by dynamic path length) as a function of dif- 
ferent network characteristics. The value of (63) may be less 
than one or greater than one, depending on the community 
network. The horizontal line corresponds to (63) = 1. 



the connection between structural (topological) shortest 
paths, which are elementary network characteristics and 
on which others measures, such as betweenness centrality, 
are based, and the lengths of certain types of functional 
(dynamic) spreading paths. We have introduced the ad- 
ditional layer of network sampling which is relevant from 
an empirical point of view but which, as we have seen, 
typically complicates the relationship between structural 
and functional paths. 

More specifically, we have considered the properties 
of three different types of paths in social networks. In 
particular, we have compared their lengths under partial 
network observation, i.e. when there is sampling at the 
level of nodes, ties, or both. The paths we studied were: 
(1) the stochastic path taken by a spreading process from 
source to target, which is known in simulations; (2) the 
shortest path from source to target in a fully observed 
network; and (3) the shortest path from source to target 
in a partially observed network. 

Our findings counteract the naive intuition that sam- 
pling will always inflate path lengths, in other words, the 
notion that dealing with a partially observed network 
would necessarily make processes seem to travel farther 
than the actually do. The shortest path between any 
two nodes in a partially observed network will, of course, 
be as long or longer than the shortest path between the 
same nodes in a fully observed network. However, in 
some cases, the upward bias caused by partial observa- 
tion, the extent of which depends on the structure of 
the underlying network, can be offset by the tendency 
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of spreading processes to take non-optimal (longer than 
shortest) paths, the extent of which depends on the de- 
tails of the spreading process. In some of the community 
networks studied, the sampled path lengths were always 
shorter than the actual path lengths, while in other net- 
works either could be shorter, depending on the extent 
and nature (nodes vs. ties) of sampling. We found that 
when averaged over all community networks, there were 
more fluctuations present for the weighted process than 
for the unweighted one. In particular, the fluctuations 
were especially pronounced for sampled paths. 

Since social networks are almost never fully observed, 
even if some facet of them might be, such as electronic 
communication records under ideal circumstances, it is 
important to understand the impact of sampling on path 
lengths, and it is likely to find many applications. For 
example, in a recent study, in addition to epidemiolog- 
ical and genomic data, Gardy and coauthors used a so- 
cial network constructed from patient interviews to deter- 
mine the origin and transmission dynamics of a tubercu- 
losis outbreak [16]. Traditional contact tracing (the iden- 
tification and diagnosis of persons who may have come 
into contact with an infected person) did not identify a 
probable source. However, the structure of the elicited 
social network suggested "the most likely source" of the 
epidemic. Although it is not clear how the source was 
identified, it was likely inferred from (partially observed) 
topological shortest paths. 

Another recent study by Rocha, Liljeros, and Holme 
studied a network of alleged offline sexual contacts be- 
tween anonymous escorts and sex buyers as self-reported 
by both parties in an online community [39]. Approx- 
imately 71% of the individuals in the largest connected 
component were reachable by following the time-ordering 
of the contacts, suggesting that a majority of the com- 
ponent was connected in a way that would allow sex- 



ually transmitted diseases to spread between its mem- 
bers [39]. In this case, time-ordered data were available, 
which strongly limits the possible spreading paths, given 
that the contacts need to happen in a certain tempo- 
ral sequence to potentially transmit a harmful virus or 
bacterium. Nevertheless, the system is a sample of the 
underlying population, since the buyers and sellers could 
be sexually active with individuals not members of the 
online community. If one were to calculate, for example, 
how far a given strain of the HIV could have travelled 
and, hence, how many individuals might have been ex- 
posed to it, misestimating the path lengths might lead to 
misestimates of the size of the epidemic. 

There are three obvious ways to extend our work. 
First, there is the structure of the underlying network, 
and the results are expected to vary significantly as the 
topology of this platform is varied. Second, there are the 
details of the spreading process, which could be modi- 
fied to be more realistic, and could be tailored towards 
specific illnesses. Further, to study the spread of behav- 
iors and norms, it might be fruitful to include ideas from 
the growing literature on complex contagions [5, 30, 31]. 
Third, in our sampling scheme, the units of sampling 
were either nodes, ties, or both nodes and ties, but one 
could study the phenomenon for more realistic sampling 
designs, such as respondent driven sampling (RDS) used 
to study small but important hard-to-reach populations, 
such as injection drug users [40]. Finally, although we 
have framed the problem in the context of social net- 
works, the concepts are generic, and they could be ap- 
plied to any type of network for which an understanding 
of the permeation depths of dynamic processes in sam- 
pled data are important. 
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